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We consider the problem of combining a (possibly uncountably 
infinite) set of afRne estimators in nonparametric regression model 
with heteroscedastic Gaussian noise. Focusing on the exponentially 
weighted aggregate, we prove a PAC-Bayesian type inequality that 
leads to sharp oracle inequalities in discrete but also in continuous 
settings. The framework is general enough to cover the combinations 
of various procedures such as least square regression, kernel ridge re- 
gression, shrinking estimators and many other estimators used in the 
literature on statistical inverse problems. As a consequence, we show 
that the proposed aggregate provides an adaptive estimator in the 
exact minimax sense without discretizing the range of tuning param- 
eters or splitting the set of observations. We also illustrate numeri- 
cally the good performance achieved by the exponentially weighted 
te. 



1. Introduction. There is growing empirical evidence of superiority of 
aggregated statistical procedures, also referred to as blending, stacked gener- 
alization or ensemble methods, with respect to "pure" ones. Since their intro- 
duction in the 1990s, famous aggregation procedures such as Boosting [30], 
Bagging [7] or Random Forest [2] have been successfully used in practice 
for a large variety of applications. Moreover, most recent Machine Learning 
competitions such as the Pascal VOC or Netflix challenge have been won 
by procedures combining different types of classifiers/predictors/estimators. 
It is therefore of central interest to understand from a theoretical point of 
view what kind of aggregation strategies should be used for getting the best 
possible combination of the available statistical procedures. 
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1.1. Historical remarks and motivation. In the statistical literature, to 
the best of our knowledge, theoretical foundations of aggregation proce- 
dures were first studied by Nemirovski (Nemirovski [48], Juditsky and Ne- 
mirovski [37]) and independently by a series of papers by Catoni (see [11] 
for an account) and Yang [63-65]. For the regression model, a significant 
progress was achieved by Tsybakov [60] with introducing the notion of opti- 
mal rates of aggregation and proposing aggregation-rate-optimal procedures 
for the tasks of linear, convex and model selection aggregation. This point 
was further developed in [9, 46, 53], especially in the context of high dimen- 
sion with sparsity constraints and in [51] for Kullback-Leibler aggregation. 
However, it should be noted that the procedures proposed in [60] that prov- 
ably achieve the lower bounds in convex and linear aggregation require full 
knowledge of design distribution. This limitation was overcome in the recent 
work [62]. 

From a practical point of view, an important limitation of the previously 
cited results on aggregation is that they are valid under the assumption that 
the aggregated procedures are deterministic (or random, but independent 
of the data used for aggregation). The generality of those results — almost 
no restriction on the constituent estimators — compensates to this practical 
limitation. 

In the Gaussian sequence model, a breakthrough was reached by Leung 
and Barron [45] . Building on very elegant but not very well-known results by 
George [32]^, they established sharp oracle inequalities for the exponentially 
weighted aggregate (EWA) for constituent estimators obtained from the data 
vector by orthogonally projecting it on some linear subspaces. Dalalyan and 
Tsybakov [21, 22] showed the result of [45] remains valid under more general 
(non-Gaussian) noise distributions and when the constituent estimators are 
independent of the data used for the aggregation. A natural question arises 
whether a similar result can be proved for a larger family of constituent es- 
timators containing projection estimators and deterministic ones as specific 
examples. The main aim of the present paper is to answer this question by 
considering families of affine estimators. 

Our interest in affine estimators is motivated by several reasons. First, 
affine estimators encompass many popular estimators such as smoothing 
splines, the Pinsker estimator [28, 49], local polynomial estimators, non- 
local means [8, 56], etc. For instance, it is known that if the underlying 
(unobserved) signal belongs to a Sobolev ball, then the (linear) Pinsker esti- 
mator is asymptotically minimax up to the optimal constant, while the best 



^Corollary 2 in [32] coincides with Theorem 1 in [45] in the case of exponential weights 
with temperature /? — 2a^; cf. equation (2.2) below for a precise definition of exponential 
weights. Furthermore, to the best of our knowledge, [32] is the first reference using the 
Stein lemma for evaluating the expected risk of the exponentially weighted aggregate. 
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projection estimator is only rate-minimax. A second motivation is that — as 
proved by Juditsky and Nemirovski [38] — the set of signals that are well 
estimated by linear estimators is very rich. It contains, for instance, sam- 
pled smooth functions, sampled modulated smooth functions and sampled 
harmonic functions. One can add to this set the family of piecewise con- 
stant functions as well, as demonstrated in [50], with natural application in 
magnetic resonance imaging. It is worth noting that oracle inequalities for 
penalized empirical risk minimizer were also proved by Golubev [36], and 
for model selection by Arlot and Bach [3], Baraud, Giraud and Huet [5]. 

In the present work, we establish sharp oracle inequalities in the model 
of heteroscedastic regression, under various conditions on the constituent 
estimators assumed to be affine functions of the data. Our results provide 
theoretical guarantees of optimality, in terms of expected loss, for the ex- 
ponentially weighted aggregate. They have the advantage of covering in a 
unified fashion the particular cases of frozen estimators considered in [22] 
and of projection estimators treated in [45]. 

We focus on the theoretical guarantees expressed in terms of oracle in- 
equalities for the expected squared loss. Interestingly, although several recent 
papers [3, 5, 35] discuss the paradigm of competing against the best linear 
procedure from a given family, none of them provide oracle inequalities with 
leading constant equal to one. Furthermore, most existing results involve 
some constants depending on different parameters of the setup. In contrast, 
the oracle inequality that we prove herein is with leading constant one and 
admits a simple formulation. It is established for (suitably symmetrized, if 
necessary) exponentially weighted aggregates [11, 21, 32] with an arbitrary 
prior and a temperature parameter which is not too small. The result is 
nonasymptotic but leads to an asymptotically optimal residual term when 
the sample size, as well as the cardinality of the family of constituent esti- 
mators, tends to infinity. In its general form, the residual term is similar to 
those obtained in the PAC-Bayes setting [42, 47, 57] in that it is proportional 
to the Kullback-Leibler divergence between two probability distributions. 

The problem of competing against the best procedure in a given family 
was extensively studied in the context of online learning and prediction with 
expert advice [16, 39]. A connection between the results on online learning 
and statistical oracle inequalities was established by Gerchinovitz [33]. 

1.2. Notation and examples of linear estimators. Throughout this work, 
we focus on the heteroscedastic regression model with Gaussian additive 
noise. We assume we are given a vector Y = (yi, . . . , yn)~^ S M" obeying the 
model 



(1.1) 



yi = fi + ii for i = l,...,n. 
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where ^ = [Ci, ■ ■ ■ ,Cn) is a centered Gaussian random vector, /j = i{xi) 
where f : — )■ M is an unknown function and xi, . . . ,Xn £ X are determinis- 
tic points. Here, no assumption is made on the set X. Our objective is to 
recover the vector f = (/i, . . . , fn)~^, often referred to as signal, based on the 
data yi, . . . , yn- In our work, the noise covariance matrix S = E[^^^] is as- 
sumed to be finite with a known upper bound on its spectral norm |||S|||. We 
denote by (•|-)n the empirical inner product in R": (u|v)„ = (1/n) UiVi. 
We measure the performance of an estimator f by its expected empirical 
quadratic loss: r = E[||f - f ||2] where ||f - f ||2 = 1 Y:tlif^ " fi?- 

We only focus on the task of aggregating affine estimators ix indexed by 
some parameter A G A. These estimators can be written as affine transforms 
of the data Y = (yi, . . . , yn)^ £ I^""- Using the convention that all vectors are 
one-column matrices, we have = AxY + b;^, where the n x n real matrix 
Ax and the vector G are deterministic. It means the entries of Ax 
and hx may depend on the points xi,...,Xn but not on the data Y. Let 
us describe now different families of linear and affine estimators successfully 
used in the statistical literature. Our results apply to all these families, 
leading to a procedure that behaves nearly as well as the best (unknown) 
one of the family. 

Ordinary least squares. Let {5a : A G A} be a set of linear subspaces of M". 
A well-known family of affine estimators, successfully used in the context 
of model selection [6], is the set of orthogonal projections onto 5a. In the 
case of a family of linear regression models with design matrices Xx, one 
has ^A = ^a(-^a'^^)^"'^a'' where {XjXx)~^ stands for the Moore-Penrose 
pseudo-inverse of Xj Xx- 

Diagonal filters. Other common estimators are the so-called diagonal fil- 
ters corresponding to diagonal matrices A = diag(ai, . . . ,an). Examples in- 
clude the following: 

• Ordered projections: = l(fc<A) for some integer A is the indicator 
function]. Those weights are also called truncated SVD (Singular Value 
Decomposition) or spectral cutoff. In this case a natural parametrization 
is A = {1, . . . , n}, indexing the number of elements conserved. 

• Block projections: = ^k<wi) + Y.TJi ^j'^iw,<k<wj+i), k = l,...,n, where 
Aj G {0,1}. Here the natural parametrization is A = {0,1}"*"^, indexing 
subsets of {1, . . . , m — 1}. 

• Tikhonov-Philipps filter: Ofc = ; where w,a> 0. In this case, A = 

(M^)^, indexing continuously the smoothing parameters. 

• Pinsker filter: a/; = (1 — — )+, where x^ = max(j;, 0) and {w, a) = A G A = 

(m;)2. 

Kernel ridge regression. Assume that we have a positive definite kernel 
k:X X X ^M. and we aim at estimating the true function / in the associated 



AGGREGATION OF AFFINE ESTIMATORS 



5 



reproducing kernel Hilbert space {Tik, \\ ■ \\k)- The kernel ridge estimator 
is obtained by minimizing the criterion ||Y — f ||,^ + A||f ||| w.r.t. / G Hk 
(see [58], page 118). Denoting by K the n x n kernel-matrix with element 
unique solution f is a linear estimate of the data, f = 
A\y, with Ax = K{K + nXInxn)~^ , where Inxn is the nxn identity matrix. 

Multiple Kernel learning. As described in [3], it is possible to handle the 
case of several kernels ki, . . . ,kM , with associated positive definite matrices 
Ki, . . . , Km- For a parameter A = (Ai, . . . , Am) G A = R^^, one can define the 
estimators = AxY with 



It is worth mentioning that the formulation in equation (1.2) can be linked 
to the group Lasso [66] and to the multiple kernel learning introduced in 
[41] — see [3] for more details. 

Moving averages. If we think of coordinates of f as some values assigned to 
the vertices of an undirected graph, satisfying the property that two nodes 
are connected if the corresponding values of f are close, then it is natural to 
estimate fi by averaging out the values Yj for indices j that are connected 
to i. The resulting estimator is a linear one with a matrix A = {aij)^^^-^ such 
that aij = ly. (j)/nj, where Vi is the set of neighbors of the node i in the 
graph and Tij is the cardinality of Vi. 

1.3. Organization of the paper. In Section 2 we introduce EWA and state 
a PAC-Bayes type bound in expectation assessing optimality properties of 
EWA in combining affine estimators. The strengths and limitations of the 
results are discussed in Section 3. The extension of these results to the case 
of grouped aggregation — in relation with ill-posed inverse problems — is de- 
veloped in Section 4. As a consequence, we provide in Section 5 sharp oracle 
inequalities in various setups: ranging from finite to continuous families of 
constituent estimators and including sparse scenarii. In Section 6 we apply 
our main results to prove that combining Pinsker's type filters with EWA 
leads to asymptotically sharp adaptive procedures over Sobolev ellipsoids. 
Section 7 is devoted to numerical comparison of EWA with other classical 
filters (soft thresholding, blockwise shrinking, etc.) and illustrates the po- 
tential benefits of aggregating. The conclusion is given in Section 8, while 
the proofs of some technical results (Propositions 2-6) are provided in the 
supplementary material [20]. 

2. Aggregation of estimators: Main results. In this section we describe 
the statistical framework for aggregating estimators and we introduce the 
exponentially weighted aggregate. The task of aggregation consists in esti- 



(1.2) 
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mating f by a suitable combination of the elements of a family of constituent 
estimators F\ = (fA)AGA ^ The target objective of the aggregation is to 
build an aggregate faggr that mimics the performance of the best constituent 
estimator, called oracle (because of its dependence on the unknown func- 
tion f). In what follows, we assume that A is a measurable subset of M^, 
for some M G N. 

The theoretical tool commonly used for evaluating the quality of an ag- 
gregation procedure is the oracle inequality (01), generally written 

(2.1) E[||fagg, - f ||2] < Cn inf E[||fA - nl] + Rn, 

A£ A. 

with residual term tending to zero as n— )• oo, and leading constant Cn 
being bounded. The OIs with leading constant one are of central theoret- 
ical interest since they allow to bound the excess risk and to assess the 
aggregation-rate-optimality. They are often referred to as sharp 01. 

2.1. Exponentially weighted aggregate (EWA). Let r\ =E[||fA — f||n] de- 
note the risk of the estimator f^, for any A G A, and let fx be an estimator of 
rx. The precise form of fx strongly depends on the nature of the constituent 
estimators. For any probability distribution vr over A and for any /3 > 0, we 
define the probability measure of exponential weights, vr, by 

(2.2) 7r(dA) = 0(A)7r(dA) with 0(A) - exp(-nrA//3) 



/^exp(-nf(^//3)7r(da;) ' 

The corresponding exponentially weighted aggregate, henceforth denoted by 
FewAi is the expectation of w.r.t. the probability measure vr: 

(2.3) fEWA = / f A T^{d\). 

J A 

We will frequently use the terminology of Bayesian statistics: the measure 
vr is called prior, the measure vr is called posterior and the aggregate fswA 
is then the posterior mean. The parameter /3 will be referred to as the tem- 
perature parameter. In the framework of aggregating statistical procedures, 
the use of such an aggregate can be traced back to George [32]. 

The interpretation of the weights 0{X) is simple: they up-weight estima- 
tors all the more that their performance, measured in terms of the risk 
estimate fx, is good. The temperature parameter reflects the confidence we 
have in this criterion: if the temperature is small (/3 w 0), the distribution 
concentrates on the estimators achieving the smallest value for fx , assigning 
almost zero weights to the other estimators. On the other hand, if /3 — )• -|-oo, 
then the probability distribution over A is simply the prior vr, and the data 
do not influence our confidence in the estimators. 
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2.2. Main results. In this paper we only focus on affine estimators 

(2.4) ix = AxY + hx, 

where the n x n real matrix Ax and the vector hx G are deterministic. 
Furthermore, we will assume that an unbiased estimator S of the noise 
covariance matrix S is available. It is well known (cf. Appendix for details) 
that the risk of the estimator (2.4) is given by 

(2.5) rx = E[\\{x - = II (Ax - Inxn){ + hxWl + ^'^^f^^"^ 
and that f""^^, defined by 

(2.6) = ||Y - hWl + - T^i^Ax) - - Tr[S], 

n n 

is an unbiased estimator of rx- Along with f^^, we will use another estimator 
of the risk that we call the adjusted risk estimate and define by 

(2.7) ff = II Y - hWl + - Tr(SAA) - - Tr[S] +-Y^(^a - ^!)Y. 

n n n 



^unb 

'a 



One can notice that the adjusted risk estimate r^^ coincides with the un 



biased risk estimate if and only if the matrix Ax is an orthogonal 



projector. 

To state our main results, we denote by "Pa the set of all probability 
measures on A and by IC{p,p') the Kullback-Leibler divergence between two 
probability measures p,p' £ V\: 



IC{p,p'[ 




f dp \ 

logi-^{X)\p{dX), if p is absolutely continuous w.r.t. p' , 



otherwise. 



We write Si ^ S2 (resp., 5i ^ ^2) for two symmetric matrices Si and 5*2, 
when S2 — Si (resp., Si — S2) is semi-definite positive. 

Theorem 1. Let all the matrices Ax be symmetric and S be unbiased 
and independent ofy. 

(i) Assume that for all A, A' G A, it holds that AxAx' = Ay Ax, AxT, + 
^Ax ^ and = 0. If f3 > 8|||S|||, then the aggregate fswA defined by equa- 
tions (2.2), (2.3) and the unbiased risk estimate fx = f^"^ (2-6) satisfies 

(2.8) E[||fEWA-f||^]< inf j / n\h-nl]p{dX) + ^lC{P,^)\- 
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(ii) Assume that, for all A G A, A\ ^ Inxn oind A\h\ = 0. If /3 > 4|||S|||, 
then the aggregate Tewa defined by equations (2.2), (2.3) and the adjusted 
risk estimate rx = f^^^ (2-7) satisfies 

E[||fEWA-f||^]< inf I / E[\\h-f\\l]p{dX) + ^IC{p,n) 

+ - [ {{-"iAx-ADi + TTimx-ADMdX)]. 

The simplest setting in which all the conditions of part (i) of Theorem 1 
are fulfilled is when the matrices Ax and S are all diagonal, or diagonaliz- 
able in a common base. This result, as we will see in Section 6, leads to a 
new estimator which is adaptive, in the exact minimax sense, over the col- 
lection of all Sobolev ellipsoids. It also suggests a new method for efficiently 
combining varying-block-shrinkage estimators, as described in Section 5.4. 

However, part (i) of Theorem 1 leaves open the issue of aggregating affine 
estimators defined via noncommuting matrices. In particular, it does not 
allow us to evaluate the MSE of EWA when each Ax is a convex or linear 
combination of a fixed family of projection matrices on nonorthogonal linear 
subspaces. These kinds of situations may be handled via the result of part 
(ii) of Theorem 1. One can observe that in the particular case of a finite 
collection of projection estimators (i.e., Ax = A'j^ and = for every A), 
the result of part (ii) offers an extension of [45], Corollary 6, to the case of 
general noise covariances ([45] deals only with i.i.d. noise). 

An important situation covered by part (ii) of Theorem 1, but not by 
part (i), concerns the case when signals of interest f are smooth or sparse 
in a basis i3sig which is different from the basis iSnoisc orthogonalizing the 
covariance matrix S. In such a context, one may be interested in considering 
matrices Ax that are diagonalizable in the basis Bsig which, in general, do 
not commute with S. 

Remark 1. While the results in [45] yield a sharp oracle inequality in 
the case of projection matrices Ax, they are of no help in the case when 
the matrices Ax are nearly idempotent and not exactly. Assertion (ii) of 
Theorem 1 fills this gap by showing that if max;)^Tr[A;^ — Aj^] < 5, then 

]E[||fEWA — f lln] is bounded by 

inf j / E[||f,-f||2]p(dA) + ^/C(p,vr)|+<5(||f||2+n-i|||E|||). 

Remark 2. We have focused only on Gaussian errors to emphasize that 
it is possible to efficiently aggregate almost any family of affine estimators. 
We believe that by a suitable adaptation of the approach developed in [22], 
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claims of Theorem 1 can be generalized — at least when are independent 
with known variances — to some other common noise distributions. 

The results presented so far concern the situation when the matrices A\ 
are symmetric. However, using the last part of Theorem 1, it is possible to 
propose an estimator of f that is almost as accurate as the best affine estima- 
tor AxY + b;^ even if the matrices A\ are not symmetric. Interestingly, the 
estimator enjoying this property is not obtained by aggregating the original 
estimators fx = AxY + hx but the "symmetrized" estimators ix = AxY + hx, 
where Ax = Ax + A^ — Aj^Ax- Besides symmetry, an advantage of the ma- 
trices Ax, as compared to the Ax's, is that they automatically satisfy the 
contraction condition Ax '^Inxn required by part (ii) of Theorem 1. We will 
refer to this method as Symmetrized Exponentially Weighted Aggregates 
(or SEWA) [19]. 

Theorem 2. Assume that the matrices Ax and the vectors hx satisfy 
Axhx = A^l^hx = for every A G A. Assume in addition that S is an un- 
biased estimator of S and is independent of Y. Let fsEWA denote the ex- 
ponentially weighted aggregate of the (symmetrized) estimators fx = {Ax + 
A^ — AlJ^ Ax)Y + hx with the weights (2.2) defined via the risk estimate f^^. 
Then, under the conditions (3 > 4|||5]||| and 



To understand the scope of condition (C), let us present several cases of 
widely used linear estimators for which this condition is satisfied: 

• The simplest class of matrices Ax for which condition (C) holds true are 
orthogonal projections. Indeed, if ^4^ is a projection matrix, it satisfies 
AjAx = Ax and, therefore, Tr(S^A) = Ti{%A~{Ax). 

• When the matrix S is diagonal, then a sufficient condition for (C) is 
la < Sj=i ^ji- Consequently, (C) holds true for matrices having only zeros 
on the main diagonal. For instance, the fcNN filter in which the weight 
of the observation Yi is replaced by zero, that is, Ojj = ljG{ji fc}/^ 
satisfies this condition. 

• Under a little bit more stringent assumption of homoscedasticity, that is, 
when S — (7 Inxnt if tliG majtricGS A\ cire such. th.a,t cill tliG nonzGro GlGiiiGiits 
of Gach row are equal and sum up to one (or a quantity larger than one), 



(C) 




a.s. 



it holds that 



(2.9) 
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then Tv{Ax) = Tt:{AJ Ax) and (C) is fulfilled. A notable example of linear 
estimators that satisfy this condition are Nadaraya-Watson estimators 
with rectangular kernel and nearest neighbor filters. 

3. Discussion. Before elaborating on the main results stated in the previ- 
ous section, by extending them to inverse problems and by deriving adaptive 
procedures, let us discuss some aspects of the presented OIs. 

3.1. Assumptions on S. In some rare situations, the matrix S is known 
and it is natural to use S = S as an unbiased estimator. Besides this not very 
realistic situation, there are at least two contexts in which it is reasonable 
to assume that an unbiased estimator of S, independent of Y, is available. 

The first case corresponds to problems in which a signal can be recorded 
several times by the same device, or once but by several identical devices. 
For instance, this is the case when an object is photographed many times by 
the same digital camera during a short time period. Let Zi, . . . , Z^v be the 
available signals, which can be considered as i.i.d. copies of an n-dimensional 
Gaussian vector with mean f and covariance matrix E^. Then, defining Y = 
(Zi + ... + Z^)/iV and = {N - l)-\ZiZj + ■ ■ ■ + ZnZI - NYY^), we 
find ourselves within the framework covered by previous theorems. Indeed, 
Y ~ A/'n(f, Sy) with Sy = T,z/N and Ey = T,z/N is an unbiased estimate 
of Sy, independent of Y. Note that our theory applies in this setting for 
every integer N >2. 

The second case is when the dominating part of the noise comes from the 
device which is used for recording the signal. In this case, the practitioner 
can use the device in order to record a known signal, g. In digital image 
processing, g can be a black picture. This will provide a noisy signal Z 
drawn from Gaussian distribution Mnig,^), independent of Y which is the 
signal of interest. Setting S = (Z — g)(Z — g)"*^, one ends up with an unbiased 
estimator of S, which is independent of Y. 

3.2. 01 in expectation versus 01 with high probability. All the results 
stated in this work provide sharp nonasymptotic bounds on the expected 
risk of EWA. It would be insightful to complement this study by risk bounds 
that hold true with high probability. However, it was recently proved in [17] 
that EWA is deviation suboptimal: there exist a family of constituent es- 
timators and a constant C > such that the difference between the risk 
of EWA and that of the best constituent estimator is larger than C j \/n 
with probability at least 0.06. Nevertheless, several empirical studies (see, 
e.g., [18]) demonstrated that EWA has often a smaller risk than some of its 
competitors, such as the empirical star procedure [4], which are provably 
optimal in the sense of OIs with high probability. Furthermore, numerical 
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experiments carried out in Section 7 show that the standard-deviation of 
the risk of EWA is of the order of 1/n. This suggests that under some condi- 
tions on the constituent estimators it might be possible to estabhsh OIs for 
EWA that are similar to (2.8) but hold true with high probability. A step 
in proving this kind of result was done in [43], Theorem C, for the model of 
regression with random design. 

3.3. Relation to previous work and limits of our results. The 01 of the 
previous section requires various conditions on the constituent estimators 
fx = A\Y + b;^. One may wonder how general these conditions are and is it 
possible to extend these OIs to more general Fa's. Although this work does 
not answer this question, we can sketch some elements of response. 

First of all, we stress that the conditions of the present paper relax signifi- 
cantly those of previous results existing in statistical literature. For instance, 
Kneip [40] considered only linear estimators, that is, = and, more im- 
portantly, only ordered sets of commuting matrices ^a- The ordering as- 
sumption is dropped in Leung and Barron [45], in the case of projection 
matrices. Note that neither of these assumptions is satisfied for the families 
of Pinsker and Tikhonov-Philipps estimators. The present work strength- 
ens existing results in considering more general, affine estimators extending 
both projection matrices and ordered commuting matrices. 

Despite the advances achieved in this work, there are still interesting cases 
that are not covered by our theory. We now introduce a family of estimators 
commonly used in image processing that do not satisfy our assumptions. In 
recent years, nonlocal means (NLM) became quite popular in image process- 
ing [8] . This method of signal denoising, shown to be tied in with EWA [56] , 
removes noise by exploiting signals self-similarities. We briefly define the 
NLM procedure in the case of one-dimensional signals. 

Assume that a vector Y = {yi, . . . ,yn)~^ given by (1.1) is observed with 
fi = f{i/n), i = 1, . . . , n, for some function f : [0, 1] — >■ M. For a fixed "patch- 
size" k e let us define f[j] = {fi, fi+i,...,fi+k-iV and Y^jj = 
(?/j, yi+i, . . . , Vi+k-iV for every i = 1, . . . ,n - k + 1. The vectors %] and Y[j] 
are, respectively, called true patch and noisy patch. The NLM consists in 
regarding the noisy patches Y[j] as constituent estimators for estimating the 
true patch f^j^] by applying EWA. One easily checks that the constituent 
estimators Y[j] are affine in Y[jy], that is, Yjjj = ^jYjjgj + bj with Ai and bj 
independent of Yj^q] . Indeed, if the distance between i and zq is larger than 
k, then Y^^ is independent of Y^^^^ and, therefore, Ai = and bj = Y^jj. If 
|i — ioj < k, then the matrix Ai is a suitably chosen shift matrix and bj is the 
projection of Y[j] onto the orthogonal complement of the image of Ai. Un- 
fortunately, these matrices {Ai} and vectors {6j} do not fit our framework, 
that is, the assumption Aibi = Ajbi = is not satisfied. 
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Finally, our proof technique is specific to affine estimators. Its extension 
to estimators defined as a more complex function of the data will certainly 
require additional tools and is a challenging problem for future research. Yet, 
it seems unlikely to get sharp OIs with optimal remainder term for a fairly 
general family of constituent estimators (without data-splitting), since this 
generality inherently increases the risk of overfitting. 

4. Ill-posed inverse problems and group- weighting. As explained in [12, 
13], the model of heteroscedastic regression is well suited for describing in- 
verse problems. In fact, let T be a known linear operator on some Hilbert 
space with inner product For some h^H, let Y be the random 

process indexed hy g gV. such that 

(4.1) Y = Th + eC ^ {Y{g) = {Th\g)n + e({g),ygen), 

where e > is the noise magnitude and is a white Gaussian noise on 
Ti, that is, for any gi, ■ ■ ■ , gk ^ the vector (Y{gi), . . . ,Y{gk)) is Gaussian 
with zero mean and covariance matrix {{gi\gj)-H}- The problem is then the 
following: estimate the element h assuming the value of Y can be measured 
for any given 51. It is customary to use as g the eigenvectors of the adjoint T* 
of T. Under the condition that the operator T*T is compact, the SVD yields 
T(j)k = bkipk and T*'0fc = bk4>k^ for A; G N, where bk are the singular values, 
{il^k} is an orthonormal basis in Range(T) C % and {(j)k} is the corresponding 
orthonormal basis in %. In view of (4.1), it holds that 

(4.2) Y{i;k) = {h\<l)k)Hbk + ei{i^k), A; G N. 

Since in practice only a finite number of measurements can be computed, it 
is natural to assume that the values Y{ipk) are available only for k smaller 
than some integer n. Under the assumption that 6^ 7^ 0, the last equation is 
equivalent to (1.1) with /j = {h\(j)i)'^ and S = d\Sig{af,i = 1,2, . . .) for cjj = 
e6^^. Examples of inverse problems to which this statistical model has been 
successfully applied are derivative estimation, deconvolution with known 
kernel, computerized tomography — see [12] and the references therein for 
more applications. 

For very mildly ill-posed inverse problems, that is, when the singular val- 
ues bk of T tend to zero not faster than any negative power of fe, the approach 
presented in Section 2 will lead to satisfactory results. Indeed, by choosing 
/3 = 8||S||| or /3 = 4|||S|||, the remainder term in (2.8) and (2.9) becomes — 
up to a logarithmic factor — proportional to maxi<fc<„ which is the 

optimal rate in the case of very mild ill-posedness. 

However, even for mildly ill-posed inverse problems, the approach devel- 
oped in the previous section becomes obsolete since the remainder blows up 
when n increases to infinity. Furthermore, this is not an artifact of our the- 
oretical results, but rather a drawback of the aggregation strategy adopted 
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in the previous section. Indeed, the posterior probability measure vr defined 
by (2.2) can be seen as the solution of the entropy-penalized empirical risk 
minimization problem: 



(4.3) 



: arginf < / fxp{dX) + — /C(p,7r) 
P Ua n 



where the inf is taken over the set of all probability distributions. It means 
the same regularization parameter (3 is employed for estimating both the 
coefficients fi = {h\(j)i)-}{ corrupted by noise of small magnitude and those 
corrupted by large noise. Since we place ourselves in the setting of known 
operator T and, therefore, known noise levels, such a uniform treatment of all 
coefficients is unreasonable. It is more natural to upweight the regularization 
term in the case of large noise downweighting the data fidelity term and, 
conversely, to downweight the regularization in the case of small noise. This 
motivates our interest in the grouped EWA (or GEWA). 

Let us consider a partition Bi, . . . , Bj of the set { 1 , . . . , n} : Bj = {Tj + 
1, . . . , Tj+i}, for some integers = Ti < r2 < • • • < Tj+i = n. To each element 
Bj of this partition, we associate the data sub-vector Y-' = {Yi:i £ Bj) and 
the sub- vector of true function P = {fi-.i £ Bj). As in previous sections, we 
are concerned by the aggregation of affine estimators = AxY + b;^, but 
here we will assume the matrices Ax are block-diagonal: 



Ay 











with ^{ gRC^^+i' 



Similarly, we define and b-^ as the sub- vectors of and b;^, respectively, 
corresponding to the indices belonging to Bj. We will also assume that the 

noise covariance matrix S and its unbiased estimate S are block-diagonal 
with {Tj^i — Tj) X (Tj+i — Tj) blocks S-^ and S-^, respectively. This notation 

implies, in particular, that = ^^Y-' + b-^ for every j = 1, . . . , J . Moreover, 

the unbiased risk estimate r^"*^ of can be decomposed into the sum of 

unbiased risk estimates r^'"'^'^ of f^, namely, r^"*^ = "^'j^i'T^'^^^ , where 



1 



I Y^ - II + - Tr{^'A{) - - Tr[S^], j = 1, . . . , J. 
n n 

To state the analogues of Theorems 1 and 2, we introduce the following 
settings. 

Setting 1: For all A, A' G A and j £ {1, . . . , J}, A-'^ are symmetric and sat- 
isfy A-'-^A-'y = AyA-'^, A-'-^T,^ + S-'^'^ y and b-^ = 0. For a temperature vector 
(3 = (/3i, . . . , /3j)'^ and a prior vr, we define GEWA as fQE-yy^ = f\ '^i'^^'^id^), 
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where rc^ (dX) = 0^{X)7r{dX) with 

(4.4) e^{x) - -p(--r"V/3,) 



exp(-nfi'''"'^//3j)7r(da;) 

Setting 2: For every j = 1, . . . , J and for every A belonging to a^et of vr- 
measure one, the matrices Ax satisfy a.s. the inequality Tr(S-^74-^) < 
Tr(E^ {A{y A{) while the vectors are such that A!^^h{ = (A^)'^^ = 0. In 
this case, for a temperature vector /3 = (/3i, . . . , /3 j)"*" and a prior vr, we define 
GEWA as f^EWA = /a fi ('^^) ' ^^ere = (^ + {A{y - {A{y A^,)Y^ +h{ 
and TT^ is defined by (4.4). Note that this setting is the grouped version of 
the SEWA. 

Theorem 3. Assume that S is unbiased and independent ofY. Under 
setting 1, if f3j > 8|||S-' ||| for all j = 1, . . . , J , then 

(4.5) E[||fGEWA-f||^]<5^mfj/' E\\ii-P\\lp,{dX) + ^}C{p„7r)]. 

j^i Pi kJa n ) 

Under setting 2, this inequality holds true if (ij > 4|||S-'||| for every j = 
1,...,J. 

As we shall see in Section 6, this theorem allows us to propose an estimator 
of the unknown signal which is adaptive w.r.t. the smoothness properties of 
the underlying signal and achieves the minimax rates and constants over the 
Sobolev ellipsoids provided that the operator T is mildly ill-posed, that is, 
its singular values decrease at most polynomially. 

5. Examples of sharp oracle inequalities. In this section we discuss con- 
sequences of the main result for specific choices of prior measures. For con- 
veying the main messages of this section it is enough to focus on settings 1 
and 2 in the case of only one group ( J = 1). 

5.1. Discrete oracle inequality. In order to demonstrate that inequality 
(4.5) can be reformulated in terms of an 01 as defined by (2.1), let us consider 
the case when the prior vr is discrete, that is, vr(Ao) = 1 for a countable set 
Aq C a, and w.l.o.g Aq = N. Then, the following result holds true. 

Proposition 1. Let T, be unbiased, independent of Y and vr be sup- 
ported by N. Under setting 1 with J = 1 and (3 = f3i> 8|||S|||, the aggregate 
fcEWA satisfies the inequality 

(5.1) E[||fGEWA-f||^]<_inf fE[||f,-f||2] + ^^°g(^/^^^~ 



Furthermore, (5.1) holds true under setting 2 for [3 > 4|||S|||. 
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Proof. It suffices to apply Theorem 3 and to upper-bound the right- 
hand side by the minimum over all Dirac measures p = di such that vr^ > 0. 
□ 

This inequality can be compared to Corollary 2 in [5], Section 4.3. Our 
result has the advantage of having factor one in front of the expectation 
of the left-hand side, while in [5] a constant much larger than 1 appears. 
However, it should be noted that the assumptions on the (estimated) noise 
covariance matrix are much weaker in [5]. 

5.2. Continuous oracle inequality. It may be useful in practice to com- 
bine a family of affine estimators indexed by an open subset of M*''^ for some 
M € N (e.g., to build an estimator nearly as accurate as the best kernel 
estimator with fixed kernel and varying bandwidth). To state an oracle in- 
equality in such a "continuous" setup, let us denote by d2(A,5A) the largest 
real r > such that the ball centered at A of radius r — hereafter denoted 
by B\{t) — is included in A. Let Leb(-) be the Lebesgue measure in M*^. 

Proposition 2. Let S be unbiased, independent of Y. Let A C M*^ 
be an open and bounded set and let vr be the uniform distribution on A. 
Assume that the mapping X*-^ rx is Lipschitz continuous, that is, \ry —rx\ < 
Lfll A' — AII2, VA, A' G A. Under setting 1 with J = 1 and P = Pi>8\\\Ti\\\, the 
aggregate fcEWA satisfies the inequality 

EllfGEWA - f 11^, < inf {E[||f, - f + ^ logf . ) I 

AeA [ n \2mm(77, ^, a2(A, aAj) / J 

^^•^^ ^ L, + /31og(Leb(A)) 

n 

Furthermore, (5.2) holds true under setting 2 for every (3 > 4|||S|||. 

Proof. It suffices to apply assertion (i) of Theorem 1 and to upper- 
bound the right-hand side in inequality (2.8) by the minimum over all mea- 
sures having as density Pa*,t*(A) = 1^;,^, (T-.)(A)/Leb(i?;^*(r*)). Choosing 
r* = min(n~^, (i2(A*, (9A)) such that Bx*{t*) C A, the measure pA*,T*(A)(iA 
is absolutely continuous w.r.t. the uniform prior vr and the Kullback-Leibler 
divergence between these two measures equals log{Leb(A)/Leb(i?;^*(T*))}. 
Using Leb(5A*(T*)) > (2r*/\/M)^'^ and the Lipschitz condition, we get the 
desired inequality. □ 

Note that it is not very stringent to require the risk function r\ to be 
Lipschitz continuous, especially since this condition needs not be satisfied 
uniformly in f . Let us consider the ridge regression: for a given design matrix 
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X G M"^P, Ax = + 7„A/„xn)~^^^ and bA = with A G [A*, A*], 

7.„ being a given normaUzation factor typically set to n or -y/n, A^, > and 
A* G [A*,oo]. One can easily check the Lipschitz property of the risk function 
with Lr = Lrif) = 4A-i||f ||2 + (2/n) Tr(S). 

5.3. Sparsity oracle inequality. The continuous oracle inequality stated 
in the previous subsection is well adapted to the problems in which the 
dimension M of A is small w.r.t. the sample size n (or, more precisely, the 
signal to noise ratio n/|||S|||). When this is not the case, the choice of the 
prior should be done more carefully. For instance, consider A C M.^'^ with 
large M under the sparsity scenario: there is a sparse vector A* G A such 
that the risk of f;^* is small. Then, it is natural to choose a prior that favors 
sparse A's. This can be done in the same vein as in [21-24], by means of the 
heavy tailed prior, 

M 

(5.3) vr(dA) oc J] TTTJTT^^^^^^^ 

m=l U + l-^m/ri ) 

where r > is a tuning parameter. 

Proposition 3. Let S be unbiased, independent ofY. Let A = and 
let vr be defined by (5.3). Assume that the mapping X^rx is continuously 
differentiable and, for some M x M matrix M, satisfies 

(5.4) rA-rv-VrX,(A-A')<(A-A')^M(A-A') VA,A'gA. 

Under setting 1 if f3 > 8|||S|||, then the aggregate fswA = fcEWA satisfies 

E[||fGEWA - f 11^] < inf llEllfA + l°gf 1 + ^) I 

y m=l ^ ^ J 

(5.5) 

Moreover, (5.5) holds true under setting 2 if P > 4|||S|||. 

Let us discuss here some consequences of this sparsity oracle inequality. 
First of all, consider the case of (linearly) combining frozen estimators, that 
is, when = X]j=i -^jVi with some known functions ipj. Then, it is clear that 
'^A - rx' - VrJ, (A - A') = 2(A - A')'''$(A - A'), where $ is the Gram matrix 
defined by = {ipi\Lpj)n. So the condition in Proposition 3 consists in 
bounding the Gram matrix of the atoms (pj . Let us remark that in this case — 
see, for instance, [22, 23] — Tr{Ai) is on the order of M and the choice r = 

f5/{nM) ensures that the last term in the right-hand side of equation (5.5) 
decreases at the parametric rate 1/n. This is the choice we recommend for 
practical applications. 
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As a second example, let us consider the case of a large number of linear 
estimators gi = GiY, . . . ,gM = Gm^ satisfying conditions of setting 1 and 
such that maxm=i^...^j,f III III < 1- Assume we aim at proposing an estimator 
mimicking the behavior of the best possible convex combination of a pair 
of estimators chosen among gi, • • • , gj\/. This task can be accomplished in 
our framework by setting A = M^^ and ix = Aigi + • • • + XmSm, where A = 
(Ai, . . . ,Xm). Remark that if {gm} satisfies conditions of setting 1, so does 
{ix}. Moreover, the mapping A i— t- r;^ is quadratic with Hessian matrix V'^rx 
given by the entries 2{Gmi\Gm'i)n + f Tr(Gm'SGm)) m,m' = 1, . . . ,M. It 
implies that inequality (5.4) holds with A4 = \/'^r\/2. Therefore, denoting by 
af the ith diagonal entry of S and setting a = (cji, . . . , c7„), we get Tr(7W) < 

IIIE^iiC^IIKIIflln + Iklln] < ^[l|f|ln + Iklln]- Applying Proposition 3 with 
T = Y^/3/(nM), we get 



E[||fEWA - f lln] < inf mWagm + (1 - «)gm' " f ||'] 



(5.6) 



a,m,m' 



+ — log 1 + 



n 



Mn 

T 



1/2 



+ ^ll|f||^ + "~"^ 



n 



nl ' 



where the inf is taken over all a G [0, 1] and m,m' € {1, . . . ,M}. This in- 
equality is derived from (5.5) by upper-bounding the inf_j^g][jM by the infi- 
mum over A's having at most two nonzero coefficients, and A^^, that 
are nonnegative and sum to one: A„^ + A^^^ = 1- To get (5.6), one simply 

notes that only two terms of the sum log(l + |Am|T~^) are nonzero and 
each of them is not larger than log(l + t~^). Thus, one can achieve using 
EWA the best possible risk over the convex combinations of a pair of lin- 
ear estimators — selected from a large (but finite) family — at the price of a 
residual term that decreases at the parametric rate up to a log factor. 

5.4. Oracle inequalities for varying-block-shrinkage estimators. Let us 
consider now the problem of aggregation of two-block shrinkage estima- 
tors. This means that the constituent estimators have the following form: 
for A= {a,b,k) e [0,1]^ x {!,... ,n} := A, fA = where = diag(al(? < 
k) + bl{i > k),i = 1, . . . ,n). Let us choose the prior vr as uniform on A. 

Proposition 4. Let Tewa be the exponentially weighted aggregate hav- 
ing as constituent estimators two-block shrinkage estimators AxY. If S is 
diagonal, then for any A G A and for any (3 > 8|||S|||, 

(5.7) E[||fEWA - f 11^] < - f lln] + ^{l + log( "'"^"\y^'^^^ 

In the case S = /nxm this result is comparable to [44], page 20, The- 
orem 2.49, which states that in the homoscedastic regression model (S = 
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Inxn), EWA acting on two-block positive-part James-Stein estimators sat- 
isfies, for any A G A such that 3 < k < n — 3 and for (3 = 8, the oracle in- 
equality 

(5.8) E[||fLeung - f 11^] < - f\\l] + 1 + 1 mil^l^ V (log !^ - 1^ | . 

6. Application to minimax adaptive estimation. Pinsker proved in his 
celebrated paper [49] that in the model (1.1) the minimax risk over ellip- 
soids can be asymptotically attained by a linear estimator. Let us denote by 
9k{i) = (f|<^A:)n the coefficients of the (orthogonal) discrete cosine^ (DCT) 
transform of f, hereafter denoted by Pf. Pinsker's result — restricted to 
Sobolev ellipsoids Fv{a,R) = {ieW: Y,l^^ k^"Okiif < R}— states that, 
as n — )• oo, the equivalences 

(6.1) inf sup E[||f-f||2] ~inf sup E[PY - f ||2] 



.2) ~ inf sup E[\\Aa^aY 



InJ 



hold [61], Theorem 3.2, where the first inf is taken over all possible estima- 
tors f and Aa^w = T^~^ diag((l — /w)j^\ = 1, . . . , n)T> is the Pinsker filter in 
the discrete cosine basis. In simple words, this implies that the (asymptoti- 
cally) minimax estimator can be chosen from the quite narrow class of linear 
estimators with Pinsker's filter. However, it should be emphasized that the 
minimax linear estimator depends on the parameters a and R, that are gen- 
erally unknown. An (adaptive) estimator, that does not depend on {a,R) 
and is asymptotically minimax over a large scale of Sobolev ellipsoids, has 
been proposed by Efromovich and Pinsker [29]. The next result, that is, a 
direct consequence of Theorem 1, shows that EWA with linear constituent 
estimators is also asymptotically sharp adaptive over Sobolev ellipsoids. 

Proposition 5. Let A = (q, w) G A = and consider the prior 

2„-"/(2"+i) 

(6.3) vrfdA) = — e~°'dadw, 

where = nja^ . Then, in model (1.1) with homoscedastic errors, the aggre- 
gate fEWA based on the temperature /3 = So"^ and the constituent estimators 
fa,w) = ^o,«)Y (with Aa^w being the Pinsker filter) is adaptive in the exact 
minimax sense^ on the family of classes { J-x>(a, i?) : a > 0, i? > 0} . 



■^The results of this section hold true not only for the discrete cosine transform, but 
also for any linear transform T> such that = T>^T> = n~^7„xn. 

"See [61], Definition 3.8. 
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It is worth noting that the exact minimax adaptivity property of our 
estimator fswA is achieved without any tuning parameter. All previously 
proposed methods that are provably adaptive in an exact minimax sense 
depend on some parameters such as the lengths of blocks for blockwise 
Stein [14] and Efromovich-Pinsker [28] estimators or the step of discretiza- 
tion and the maximal value of bandwidth [13]. Another nice property of the 
estimator fewA is that it does not require any pilot estimator based on the 
data splitting device [31]. 

We now turn to the setup of heteroscedastic regression, which corresponds 
to ill-posed inverse problems as described in Section 4. To achieve adaptivity 
in the exact minimax sense, we make use of fcEWA, the grouped version of the 
exponentially weighted aggregate. We assume hereafter that the matrix S is 
diagonal with diagonal entries af, . . . ,cj^ satisfying the following property: 

(6.4) 3o-*,7>0 such that cr^ = o-^A:2'^(l + Ofc(l)) as A; oo. 

This kind of problems arises when T is a differential operator or the Radon 
transform [12], Section 1.3. To handle such situations, we define the groups 
in the same spirit as the weakly geometrically increasing blocks in [15]. Let 

— 1/3 

V = Vn be a positive integer that increases as n — )• oo. Set Pn = Vn and 
define 

f65) ^ ^ r(l + ^n)^-^-l, j = l,2, 

' \Tj^l+[VnPn{l+Pny~^\, J = 3, 4, . . . , 

where [xj stands for the largest integer strictly smaller than x. Let J be 
the smallest integer j such that Tj > n. We redefine Tj+i = n and set Bj = 
{Tj + 1,..., Tj+i} for ah j = 1, . . . , J. 

Proposition 6. Let the groups Bi,. . . ,Bj be defined as above with Vn 
satisfying log i^^/ log n — )■ oo and — )• oo as n ^ oo. Let A = {a,w) G A = 
and consider the prior 

2„-a/{2a+27+l) 

(6.6) n(dX) = ,„ — —e ""dadw. 

Then, in model (1.1) with diagonal covariance matrix S = diag((T^; 1 < A; < 
n) satisfying condition (6.4), the aggregate fcEWA (under setting 1) based 
on the temperatures f3j = Smaxjgs^. af and the constituent estimators ta,w = 
Aa^^Y (with Aa^w being the Pinsker filter) is adaptive in the exact minimax 
sense on the family of classes {J-{a, R) : a > 0, R> 0} . 



Note that this result provides an estimator attaining the optimal constant 
in the minimax sense when the unknown signal lies in an ellipsoid. This 
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property holds because minimax estimators over the eUipsoids are Hnear. For 
other subsets of M", such as hyper-rectangles, Besov bodies and so on, this is 
not true anymore. However, as proved by Donoho, Liu and MacGibbon [27], 
for orthosymmetric quadratically convex sets the minimax linear estimators 
have a risk which is within 25% of the minimax risk among all estimates. 
Therefore, following the approach developed here, it is also possible to prove 
that GEWA can lead to an adaptive estimator whose risk is within 25% of 
the minimax risk, for a broad class of hyperrectangles. 

7. Experiments. In this section we present some numerical experiments 
on synthetic data, by focusing only on the case of homoscedastic Gaussian 
noise (S = a^Inxn) with known variance. A toolbox is made available freely 
for download at http://josephsalmon.eu/code/index_codes.php. Addi- 
tional details and numerical experiments can be found in [19, 55]. 

We evaluate different estimation routines on several ID signals considered 
as a benchmark in the literature on signal processing [25]. The six signals 
we retained for our experiments because of their diversity are depicted in 
Figure 1. Since these signals are nonsmooth, we have also carried out ex- 
periments on their smoothed versions obtained by taking the antiderivative. 
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(a) Test signals in Experiment I 





(b) Test signals in Experiment II 



Fig. 1. Test signals used in our experiments: Piece-Regular, Ramp, Piece-Polynomial, 
HeaviSine, Doppler and Blocks, (a) nonsmooth (Experiment I) and (b) smooth (Experi- 
ment II). 
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Experiments on nonsmooth (resp., smooth) signals are referred to as Exper- 
iment I (resp., Experiment II). In both cases, prior to applying estimation 
routines, we normalize the (true) sampled signal to have an empirical norm 
equal to one and use the DCT denoted by 0(Y) = ((9i(Y), . . . , 9n{Y)y . 

The four tested estimation routines — including EWA — are detailed below. 

Soft- Thresholding (ST) [25]: For a given shrinkage parameter t, the soft- 
thresholding estimator is 6^ = sgn(0fc(Y))(|0fc(Y)| — at)^. We use the data- 
driven threshold minimizing the Stein unbiased risk estimate [26]. 

Blockwise James-Stein (BJS) shrinkage [10]: The set is par- 

titioned into N = [n/log(n)] blocks Bi, B2, ■ ■ ■ , of nearly equal size L. 
The corresponding blocks of true coefficients 0Bk{^) = {^ji^))j&Bk are then 
estimated by Ob^, = (1 — g^^^y-^ )+^_Bfc (Y), k = 1, . . . ,N , with blocks of noisy 

coefficients ^b,(Y), S| = ||^b,(Y)||2 and A = 4.50524. 

Unbiased risk estimate (URE) minimization with Pinsker's filters [13]: 
Pinsker filter with data-driven parameters a and w selected by minimizing 
an unbiased estimate of the risk over a suitably chosen grid for the values 
of a and w. Here, we use geometric grids ranging from 0.1 to 100 for a and 
from 1 to n for w. 

EWA on Pinsker's filters: We consider the same finite family of linear 
filters (defined by Pinsker's filters) as in the URE routine described above. 
According to Proposition 1, this leads to an estimator nearly as accurate as 
the best Pinsker's estimator in the given family. 

To report the result of our experiments, we have also computed the best 
linear smoother, hereafter referred to as the oracle, based on a Pinsker filter 
chosen among the candidates that we used for defining URE and EWA. By 
best smoother we mean the one minimizing the squared error (it can be 
computed since we know the ground truth). Results summarized in Table 1 
for Experiment I and Table 2 for Experiment II correspond to the average 
over 1000 trials of the mean squared error (MSE) from which we subtract 
the MSE of the oracle and multiply the resulting difference by the sample 
size. We report the results for a = 0.33 and for n G {2^, 2^, 2^*^, 2^^}. 

Simulations show that EWA and URE have very comparable perfor- 
mances and are significantly more accurate than soft-thresholding and block 
James-Stein (cf. Table 1) for every size n of signals considered. Improve- 
ments are particularly important when signals have large peaks or discon- 
tinuities. In most cases, EWA also outperforms URE, but differences are 
less pronounced. One can also observe that for smooth signals, the differ- 
ence of MSEs between EWA and the oracle, multiplied by n, remains nearly 
constant when n varies. This is in agreement with our theoretical results in 
which the residual term decreases to zero inversely proportionally to n. 

Of course, soft-thresholding and blockwise James-Stein procedures have 
been designed for being applied to the wavelet transform of a Besov smooth 
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Table 1 

Evaluation of 4 adaptive methods on 6 (nonsmooth) signals. For each sample size and 
each method, we report the average value of n(MSE — MSEoracio) and the corresponding 
standard deviation (in parentheses) , for 1000 replications of the experiment 



n 


EWA 


URE 


BJS 


ST 


EWA 


URE 


BJS 


ST 






Blocks 






Doppl 


er 




256 


0.051 


0.245 


9.617 


4.846 


0.062 


0.212 


13.233 


6.036 






(0.39) 


(1.78) 




l^U.oO ) 


(0.31) 


(2.11) 




512 


—0.052 


0.302 


13.807 


9.256 


—0.100 


0.205 


17.080 


12.620 




t^U.oO ) 


(0.50) 


(2.16) 




t^U.oU ) 


(0.39) 


(2.29) 




1024 


—0.050 


0.299 


19.984 


17.569 


—0.107 


0.270 


21.862 


93 006 






(0.46) 


(2.68) 




( r\ Qf^^ 

l^U.oO ) 


(0.41) 


(2.92) 




2048 


—0.007 


0.362 


28.948 


30.447 


—0.150 


0.234 


28.733 


38.671 




{V)Al) 


(0.57) 


(3.31) 


l^z.yo } 




(0.42) 


(3.19) 


yo.uz ) 






HeaviSine 






Piece-Regular 




256 


-0.060 


0.247 


1.155 


3.966 


-0.069 


0.248 


8.883 


4.879 




(0.19) 


(0.42) 


(0.57) 


(1.12) 


(0.32) 


(0.40) 


(1.76) 


(1.20) 


512 


-0.079 


0.215 


2.064 


5.889 


-0.105 


0.237 


12.147 


9.793 




(0.19) 


(0.39) 


(0.86) 


(1.36) 


(0.30) 


(0.37) 


(2.28) 


(1.64) 


1024 


-0.059 


0.240 


3.120 


8.685 


-0.092 


0.291 


15.207 


16.798 




(0.23) 


(0.36) 


(1.20) 


(1.64) 


(0.34) 


(0.46) 


(2.18) 


(2.13) 


2048 


-0.051 


0.278 


4.858 


12.667 


-0.059 


0.283 


21.543 


27.387 




(0.25) 


(0.48) 


(1.42) 


(2.03) 


(0.34) 


(0.54) 


(2.47) 


(2.77) 






Ramp 






Piece-Polynomial 




256 


0.038 


0.294 


6.933 


5.644 


0.017 


0.203 


12.201 


3.988 




(0.37) 


(0.47) 


(1.54) 


(1.20) 


(0.37) 


(0.37) 


(1.81) 


(1.19) 


512 


0.010 


0.293 


9.712 


9.977 


-0.078 


0.312 


17.765 


9.031 




(0.36) 


(0.51) 


(1.76) 


(1.67) 


(0.35) 


(0.49) 


(2.72) 


(1.62) 


1024 


-0.002 


0.300 


13.656 


16.790 


-0.026 


0.321 


23.321 


17.565 




(0.30) 


(0.45) 


(2.25) 


(2.06) 


(0.38) 


(0.48) 


(2.96) 


(2.28) 


2048 


0.007 


0.312 


19.113 


27.315 


-0.007 


0.314 


31.550 


29.461 




(0.34) 


(0.50) 


(2.68) 


(2.61) 


(0.41) 


(0.49) 


(3.05) 


(2.95) 



function, rather than to the Fourier transform of a Sobolev-smooth function. 
However, the point here is not to demonstrate the superiority of EWA as 
compared to ST and BJS procedures. The point is to stress the importance of 
having sharp adaptivity up to an optimal constant and not simply adaptivity 
in the sense of rate of convergence. Indeed, the procedures ST and BJS are 
provably rate-adaptive when applied to the Fourier transform of a Sobolev- 
smooth function, but they are not sharp adaptive — they do not attain the 
optimal constant — whereas EWA and URE do attain. 

8. Summary and future work. In this paper we have addressed the prob- 
lem of aggregating a set of affine estimators in the context of regression 
with fixed design and heteroscedastic noise. Under some assumptions on the 
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Table 2 

Evaluation of 4 adaptive methods on 6 smoothed signals. For each sample size and each 
method, we report the average value o/ 7i(MSE — MSEoracio) cind the corresponding 
standard deviation (in parentheses), for 1000 replications of the experiment 



n 


EWA 


URE 


BJS 


ST 


EWA 


URE 


BJS 


ST 






Blocks 






DoppL 


er 




256 


0.387 


0.216 


0.216 


2.278 


0.214 


0.237 


1.608 


2.777 






(0.40) 


(0.24) 






(0.40) 


(0.73) 




512 


0.170 


0.209 


0.650 


3.193 


0.165 


0.250 


1.200 


3.682 




t^u.zu ) 


(0.41) 


(0.25) 




t^U.zU ) 


(0.44) 


(0.48) 




1024 


0.162 


0.226 


1.282 


4.507 


0.147 


0.229 


1.842 


5.043 






(0.41) 


(0.44) 




/n 1 o\ 


(0.45) 


(0.86) 




2048 


0.120 


0.220 


1.574 


6.107 


0.138 


0.229 


1.864 


6.584 






(0.37) 


(0.55) 




/fl 9n\ 


(0.40) 


(1.07) 








HeaviSine 






Piece-Regular 




256 


0.217 


0.207 


1.399 


2.496 


0.269 


0.279 


2.120 


2.053 




(0.16) 


(0.42) 


(0.54) 


(0.96) 


(0.27) 


(0.49) 


(1.09) 


(0.95) 


512 


0.206 


0.221 


0.024 


3.045 


0.216 


0.248 


2.045 


2.883 




(0.18) 


(0.43) 


(0.26) 


(1.10) 


(0.20) 


(0.45) 


(1.17) 


(1.13) 


1024 


0.179 


0.200 


0.113 


3.905 


0.183 


0.228 


1.251 


3.780 




(0.18) 


(0.50) 


(0.27) 


(1.27) 


(0.20) 


(0.41) 


(0.70) 


(1.37) 


2048 


0.162 


0.189 


0.421 


5.019 


0.145 


0.223 


1.650 


4.992 




(0.15) 


(0.37) 


(0.27) 


(1.53) 


(0.19) 


(0.42) 


(1.12) 


(1.42) 






Ramp 






Piece-Polynomial 




256 


0.162 


0.200 


0.339 


2.770 


0.215 


0.257 


1.486 


2.649 




(0.16) 


(0.38) 


(0.24) 


(1.00) 


(0.25) 


(0.48) 


(0.68) 


(1.01) 


512 


0.150 


0.215 


0.425 


3.658 


0.170 


0.243 


1.865 


3.683 




(0.18) 


(0.38) 


(0.23) 


(1.20) 


(0.20) 


(0.46) 


(0.84) 


(1.20) 


1024 


0.146 


0.211 


0.935 


4.815 


0.179 


0.236 


1.547 


5.017 




(0.18) 


(0.39) 


(0.33) 


(1.35) 


(0.20) 


(0.47) 


(1.02) 


(1.38) 


2048 


0.141 


0.221 


1.316 


6.432 


0.165 


0.210 


2.246 


6.628 




(0.20) 


(0.43) 


(0.42) 


(1.54) 


(0.20) 


(0.39) 


(1.15) 


(1.70) 



constituent estimators, we have proven that EWA with a suitably chosen 
temperature parameter satisfies PAC-Bayesian type inequaUty, from which 
different types of oracle inequalities have been deduced. All these inequal- 
ities are with leading constant one and rate-optimal residual term. As an 
application of our results, we have shown that EWA acting on Pinsker's 
estimators produces an adaptive estimator in the exact minimax sense. 

Next in our agenda is carrying out an experimental evaluation of the pro- 
posed aggregate using the approximation schemes described by Dalalyan and 
Tsybakov [23], Rigollet and Tsybakov [52, 54] and Alquier and Lounici [1], 
with a special focus on the problems involving large scale data. 

Although we do not assume the covariance matrix S of the noise to be 
known, our approach relies on an unbiased estimator of S which is indepen- 
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dent on the observed signal and on an upper bound on the largest singular 
value of S. In some applications, such information may be hard to obtain 
and it can be helpful to relax the assumptions on S. This is another interest- 
ing avenue for future research for which, we believe, the approach developed 
by Giraud [34] can be of valuable guidance. 

APPENDIX: PROOFS OF MAIN THEOREMS 
We develop now the detailed proofs of the results stated in the manuscript. 

A.l. Stein's lemma. The proofs of our main results rely on Stein's lem- 
ma [59], recalled below, providing an unbiased risk estimate for any estimator 
that depends sufficiently smoothly on the data vector Y. 

Lemma 1. iei Y be a random vector drawn form the Gaussian distribu- 
tion J\fn{i,'^)- If the estimator f is a.e. differentiable in Y and the elements 
of the matrix V • f""" := {difj) have finite first moment, then 

f = II Y - f ||2 + - Tr[S(V • F)] - - Tr[S] 

is an unbiased estimate of r, that is, E[f] =r. 

The proof can be found in [61], page 157. We apply Stein's lemma to the 
affine estimators = A\Y + hx, with A\ an n x n deterministic real matrix 
and G M"' a deterministic vector. We get that if^S is an unbiased estimator 
of S, then r^^ = \\Y -ix\\l + l - ^ Tr[S] is an unbiased estimator 

of the risk rx = n\h " f lln] = II (A - /nxn)f + hxWl + I TrlA^SAj] . 

A. 2. An auxiliary result. Prior to proceeding with the proof of the main 
theorems, we prove an important auxiliary result which is the central ingre- 
dient of the proofs for our main results. 

Lemma 2. Let assumptions of Lemma 1 be satisfied. Let {TaiAe A} be 
a family of estimators of f and {fx : A S A} a family of risk estimates such 
that the mapping Y i— t- (f^, r^) is a.e. differentiable for every A € A. Let f^^^ 
be the unbiased risk estimate of given by Stein's lemma. 

(1) For every vr G Pa CLnd for any /3 > 0, the estimator fswA defined as 
the average of fx w.r.t. to the probability measure 

7r(Y, dX) = e{Y, X)-K{dX) with e{Y, A) oc exp{-nfx{Y)/ 13} 

admits 

TEWA = ^(^r^ - l|fA - fEWAll' - j{VYfx\^{h " fEWA))„) vr(dA) 

as unbiased estimator of the risk. 
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(2) //, furthermore, fx > , VA € A and f^{n\/ Y^xl^i^. 



X 



fEWA))n7r(dA) > —afj^ — {EWA\\n'^{dX) for some constant a > 0, then for 
every /3 >2a it holds that 

(A.l) E[||fEWA - f||^] < inf I / E[f,]p(dA) + ^^1^1 

pe^A [J A n } 

Proof. According to the Stein lemma, the quantity 

(A.2) fEWA = II Y - fEWAll' + -TVp(V • fEWA(Y))] - -Tr[S] 

n n 

is an unbiased estimate of the risk r„ = E[||fEWA — f ||n]- Using simple algebra, 
one checks that 

(A.3) ||Y-fEWA||'= Xv(||Y-fA||2 - ||f;,-fEWA||^)vr((iA). 

By interchanging the integral and differential operators, we get the following 
relation: a.iEWA,, = Xv{(a,jAj(Y))0(Y, A) + A,,(Y)(a,,0(Y, A))}7r((iA). 
Then, combining this equality with equations (A.2) and (A.3) implies that 

TEWA = / {fT'' - \\h - fEWAWlMdX) + - [ Tr[SfAVY0(Y, X)'^]7T{dX). 

J A n J A 

After having interchanged differentiation and integration, we obtain that 
/a fEWA (Vy^(Y, A))^7r(dA) = fEWA VY(/Ae(Y,A)7r(dA)) = and, therefore, 
we come up with the following expression for tewa: 

rEWA= [ (fr''-||fA-f„||^ + 2(VYlog^(A)|S(fA-fEWA))j7r(dA) 
J A 

= [ (^r^ - llfA - fEWAll^ - 2n/?-i(VYrA|S(fA - fEWA))Jvr(dA). 
J A 

This completes the proof of the first assertion of the lemma. 

To prove the second assertion, let us observe that under the required 
condition and in view of the first assertion, for every /3 > 2o it holds that 
rEWA < Jj,f'f>Tt{dX) < J^fxTT{dX) < fjyfx7r{dX) + ^IC{tt,tt). To conclude, it 
suffices to remark that vr is the probability measure minimizing the criterion 
/a ^\p{dX) + £/C(p, vr) among all p G V\. Thus, for every p G "Pa, we have 

fEWA< / fxP{dX) + -JC{p,7r). 
J A n 

Taking the expectation of both sides, the desired result follows. □ 
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A.3. Proof of Theorem 1. 

Assertion (i). In what follows, we use the matrix shorthand / = Inxn 
and ^EWA — /a ^A7r((iA). We apply Lemma 2 with fx = f^^^. To check the 
conditions of the second part of Lemma 2, note that in view of equations 
(2.4) and (2.6), as well as the assumptions Aj^ = Ax and Ax'hx = 0, we get 



A 

Vvrr^ = -(/ - AxVil - Ax)Y - -{I - Ax^hx = -(/ - AxfY - -hx. 

n n n n 



Recall now that for any pair of commuting matrices P and Q the identity 
(/ - P)2 = (/ - Q)2 + 2(1 - ^){Q - P) holds true. Applying this iden- 
tity to P = Ax and Q = ^ewa (in view of the commuting property of the 
^a's), we get the following relation: ((/ - Aa)^Y|S(Aa - ^ewa)Y)„ = ((/- 
^EWa)'Y|S(Aa - AeWa)Y)„ - 2((/ - .iEW|+^)(AEWA - Aa)Y|S(^ewa - 
j4a)Y)„. When one integrates over A with respect to the measure the 
term of the first scalar product in the right-hand side of the last equation 
vanishes. On the other hand, 

(^a(^ewa - Aa)Y|S(Aewa - A)Y)„ 

= (^A(fEWA - fA)|S(fEWA " fA))n 
= ((fEWA - fA)|^AS(fEWA " ^a)), 

= 7^(fEWA - fA)^(AAS + S^A)(fEWA " ^a) > 0. 
Zn 

Since positive semi-definiteness of matrices TiAx + ^a^^ implies the one of 
the matrix SAewa + S ^ewa j we also have (^ewa (^ewa — ^a ) Y | S (tIewa — 
^a)Y)„ > 0. Therefore, 

' - (^^^^ _ ^^)y|S(^ewa - Ax)Y 

<((fEWA-fA)|S(fEWA-fA)>„ 

= \\T.'/\h^^-ix)\t 

This inequality implies that 

/ (nVYrr^|S(fA-fEWA))„vr(dA)>-4 / \\T}'\ix-im^K)\\lT^{d\). 
J A J A 

Therefore, the claim of Theorem 1 holds true for every /3 > 8|||S|||. 

Assertion (ii). Let now fA = ^aY + bA with symmetric ^a :^ Inxn and 

bA G Ker(^A)- Using the definition fl^^ = f^°'' + ^Y~^{Ax - Al)Y, one easily 

checks that f'^^ > f'^"^ for every A and that 

[ (nVffj|S(fA-fEWA))„vr((iA)= [ (2(Y - fA)|S(fA - fEWA))„7r(dA) 

J A J A 
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= -2 



^^/\ix-fEWA)\\lHdX). 



Therefore, if /3 > 4|||E|||, all the conditions required in the second part of 
Lemma 2 are fulfilled. Applying this lemma, we get the desired result. 

A. 4. Proof of Theorem 2. We apply the result of assertion (ii) of Theo- 
rem 1 to the prior ^{dX) replaced by the probability measure proportional 



Condition (C) entails that the last term is always nonnegative and the result 
follows. 

A. 5. Proof of Theorem 3. Let us place ourselves in setting 1. It is clear 



that E[||fGEWA - fll^J = E/=iE[||f4EWA - f'lln]- For each j G {!,..., J}, 



since /3j > SHIS-'II, one can apply assertion (i) of Theorem 1, which leads 
to the desired result. The case of setting 2 is handled in the same manner. 

Acknowledgment. The authors thank Pierre Alquier for fruitful discus- 
sions. 
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Proofs of some propositions (DOI: 10.1214/12-AOS1038SUPP; .pdf). In 
this supplement we present the detailed proofs of Propositions 2-6. 
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